Linux is a very useful operating system for the IBM S/390 platform (also referred to as System z) because it consolidates connectivity among legacy, Linux, and middleware applications such as Web, mail, application servers, firewalls, etc. By working as a "native" operating system, Linux leverages all the hardware capability of the S/390 platform.
This article shares five troubleshooting tips to counter the various problems that can arise when you bring up a Linux system on a System z series machine:
- Surviving a Martian invasion: Martians messages are the packets with sources that don't have any known route.
- Disciplining network services on restart: Sometimes when bringing up the Linux image on an LPAR, the service just doesn't behave.
- Intercepting corruption to your file system on shutdown: There are three ways shutdown can lead your file system astray.
- Making the most of
cio_ignore: Or, how to shorten your list of "must sense and analyze" boot path devices.
- Keeping Virtual LAN from being a pain in the flash: By looking for the physical installation files.
This article refers to various versions of SUSE Linux, so as a bonus tip, I'll shed some light on a few bugs in different versions of SUSE and give you a workaround for those bugs.
1. Martians, meet our leader
Martians are the packets with sources that don't have any known route. These are packets that the Linux operating system isn't expecting, especially considering the destinations they arrive from (such as a packet from an internal host coming in through the external interface).
Martian messages may be due to misconfigurations (like an assignment of an external address to a LAN interface), a mask issue, or address spoofing (in which someone is trying to do something awful to your system). The configuration must be checked and corrected to eliminate these kinds of messages.
Sometimes, these messages are received from the switches as well since many machines are terminated to the same switch. In such a scenario, Martian messages may not be maliciously harmful but can still flood the operating system messages console. To get rid of these messages, the switch needs to be restarted; remember, though, by doing this, you will affect the other machines that terminate on the same switch. And some servers are too important to tamper with in this manner.
Problem: Overwhelmed at the OS messages console
When debugging with the OS messages console can be difficult because you have to wade through the flood of Martian messages (as shown in Figure 1):
Figure 1. A snapshot of Martian messages overloading the messages console
Solution: Stashing the little green men
You can get a cleaner OS console and hide the Martian messages by:
- Turning off packets logging:
/proc/sys/net/ipv4/conf/eth0/log_martians Ex: echo "0" > /proc/sys/net/ipv4/conf/eth0/log_martians
- Dialing down kernel logging to level 4:
yast -> system -> /etc/sysconfig Editor -> System -> Logging -> KERNEL_LOGLEVEL (7 to 4)
2. Bad network services, stand in the corner
When you bring up the Linux image on a LPAR, you may find that the network service does not behave as expected; in spite of allocating the IP address, the network can be unreachable.
Problem: Ping good, login bad
Pinging other machines from the OS messages console can be
successful, and the
ipconfig command displays
the correct information, but when you try to login (ssh/telnet) through
putty, the session is unsuccessful.
Solution: Network restart to the rescue
The root user can restart the network using
service network restart on the OS messages
console as shown in Listing 1.
Listing 1. Output of service network restart
j8806_h117 (root): /opt/javabm/data >service network restart Shutting down network interfaces: eth0 eth0 configuration: qeth-bus-ccw-0.0.0420 done Shutting down service network . . . . . . . . . . . . . done Hint: you may set mandatory devices in /etc/sysconfig/network/config Setting up network interfaces: lo lo IP address: 127.0.0.1/8 done eth0 eth0 configuration: qeth-bus-ccw-0.0.0420 eth0 IP address: 184.108.40.206/24 done vlan538 interface is not available SIOCGIFFLAGS: No such device Cannot enable interface vlan538. interface vlan538 is not up failed Setting up service network . . . . . . . . . . . . . . failed SuSEfirewall2: Warning: ip6tables does not support state matching. Extended IPv6 support disabled. SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ... SuSEfirewall2: batch committing... SuSEfirewall2: Firewall rules successfully set
Only the root user can issue this command. If you find that the gateway is incorrect or missing, you will need to include the required details to restart the network. You can do that with the following:
route add default gw 220.127.116.11
service network restart
The output in Listing 1 shows all the steps involved in shutting down and bringing up the network. If the network doesn't come up, then it will not show the IP address in the output. The IP address is defined in the file /etc/sysconfig/network/ifcfg-eth-id-XX:XX:XX:XX:XX—it shows the VLAN configuration and firewall setup. You can use this output to check all the steps if something goes wrong with the network.
For updated information in any of the network configuration files
/etc/sysconfig/hardware/ifcfg-eth-id-XX:XX:XX:XX:XX, etc.) to take effect,
you must issue the command
service network restart (or
3. Getting a corruption conviction
Sometimes in the process of shutting down the system, the file system gets corrupted for any of these reasons:
- The Linux image is improperly halted
- Any file system is improperly unmounted
- The root file system is completely, 100 percent full
In these cases, it is not be possible to bring up the same image in the subsequent attempt of booting the system.
Problem: To login prompt or not
Let's look at two corruption scenarios:
- When the root file system (/dev/dasda1) itself is corrupted. In this case, the login prompt is not provided by the system on the OS messages console.
- When file systems other than the root file system get corrupted. In this case, the login prompt is provided and only the clean file systems are mounted.
Solution: Enabling fsck whether it wants to or not
The one solution to both scenarios is running
fsck, as follows:
- The first scenario can be resolved by bringing the root online
(by using the command
chccwdev -e <DASD address>) on an existing Linux image and running
fsckforcibly. This force checks the file system in spite of it being marked clean. For example:
fsck -f /dev/dasdx1(where
xindicates the device letter for the newly added
fsckis being run).
- In the second scenario, the command
fsckis executed on the corrupted file system from the OS messages console after successful login. Once
fsckis run for all the corrupted DASDs, the system needs to be rebooted, resulting in the image coming up successfully on the subsequent reboot.
4. Making faster time on a shorter list
cio_ignore parameter is a kernel parameter
used to specify and analyze all the available devices attached to the
machine. When Linux starts booting, it senses and analyzes all the
You can use
cio_ignore to shorten the list of
devices that are to be sensed and analyzed during boot process.
Problem: It's a long line to get in
On System z, it takes considerable time for the image to come up since there are many devices (DASD, network devices, etc.) attached to the machines that are required to load different images. Regardless of the devices used by any image, the system will sense and analyze each device while booting, making for a slow booting process.
Solution: Comment out some or all the waiters
In such cases, define the range of addresses for the devices that are not necessary for the current image in the zipl.conf file. This way, all the devices defined in zipl.conf are ignored, and the system boots in a zippy manner. You can do this with:
cio_ignore=all: Specifies that all the devices are to be ignored.
cio_ignore=all, !0.0.b100-0.0.b1ff, !0.0.a100: Specifies that all devices but the range 0.0.b100 through 0.0.b1ff and the device 0.0.a100 are to be ignored. (Very customizable!)
5. Finding the physical files for your virtual LAN
Virtual LAN is a group of devices on one or more LANs that are configured (using management software) so that they can communicate, thus giving the illusion that they are attached on the same physical wire (when in fact they are located on a number of different LAN segments).
Because they are based on logical, not physical, connections, VLANs are extremely flexible. A VLAN frame looks almost the same as an Ethernet frame; the difference is that a VLAN frame has an extra field containing a number that identifies the VLAN. This number is called a VLAN tag.
Problem: Where's the tag in reality?
A fresh installation of Linux (SUSE or RedHat) on a System z machine is not possible if it is configured for VLAN but doesn't have access to a non-VLAN OSA port (an Ethernet card for System z).
When you are doing a new Linux (SUSE) installation on an LPAR, you have to install it through the network. You need to connect the LPAR (z machine) to the server exactly where the installation files are available. It will ask you for various types of information in order to set up the network before starting the installation. Nothing in these prompts indicates where you can set up to work on a VLAN-configured LAN segment. If the machine (z server) is already VLAN-tagged, then it will look only for the packets having some specific VLAN tag information.
Solution: Install without, then configure for
Neither a SUSE or RedHat distribution supports setting up VLAN on the network interface during installation, so the only solution is to have a working non-VLAN network to install the system first and then configure it for VLAN.
You should have non-VLAN-tagged hardware for a new installation. Hardware should not be configured for VLAN at the time of installation.
Bonus: Two SUSE bugs highlighted
Finally, here are two SUSE-related bugs you should be aware of.
Bug 1: VLAN tag problem in some SLES9
First up is a bug with the very old GA-level kernel SLES9—it doesn't come up on VLAN-tagged hardware. It can cause kernel panic (where the system outputs an error message to the console, dumps an image of kernel memory to disk for post-mortem debugging, and then either waits for the system to be manually rebooted or initiates an automatic reboot). It does not occur on the SLES9 SP4 kernel.
Bug 2: An Awk-ward script error in SLES10
Executing an Awk script on SLES10 and SLES10 SP2 results in the following error:
Listing 2. Executing an Awk script on SLES10 and SLES10 SP2
rapdistro7:~ # awk -f sample_test.awk get_build.messages section 1 *** glibc detected *** awk: double free or corruption (fasttop): 0x0000000080055060 *** ======= Backtrace: ========= ...... ======= Memory map: ======== ...... 3ffff952000-3ffff967000 rw-p 3ffff952000 00:00 0 [stack] Aborted rapdistro7:~ #
The same script works fine on SLES9. A bug report for this problem has
been opened in bugzilla. By the way, the workaround is to
export LC_ALL = C. And versions after
SLES10/SP2 have fixed this.
- On the z/OS V1R9.0 elements and features PDF files page, read more about ESA/390 Principles of Operation, z/Architecture Principles of Operation, and z/Architecture Reference Summary.
- Find more info on Linux and System z and S/390.
- Check out this great reference on Virtual LAN from UC-Davis.
- Read more about using Linux and VLANs at developerWorks.
- Get the most out of Linux on System z with this info on developerWorks.
- In the developerWorks Linux zone, find more resources for Linux developers (including developers who are new to Linux), and scan our most popular articles and tutorials.
- See all Linux tips and Linux tutorials on developerWorks.
- Stay current with developerWorks technical events and Webcasts.
Get products and technologies
- With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.
- Get involved in the developerWorks community through blogs, forums, podcasts, and spaces.