A number of LAN-connected systems slow down simultaneously
There are some common problems that arise in the transition from independent systems to distributed systems.
The problems usually result from the need to get a new configuration running as soon as possible, or from a lack of awareness of the cost of certain functions. In addition to tuning the LAN configuration in terms of maximum transmission units (MTU) and mbufs, look for LAN-specific pathologies or nonoptimal situations that may have evolved through a sequence of individually reasonable decisions.
- Use network statistics to ensure that there are no physical network problems. Ensure that commands such as netstat -v, entstat, tokstat, atmstat, or fddistat do not show excessive errors or collision on the adapter.
- Some types of software or firmware bugs can sporadically saturate the
LAN with broadcast or other packets.
When a broadcast storm occurs, even systems that are not actively using the network can be slowed by the incessant interrupts and by the CPU resource consumed in receiving and processing the packets. These problems are better detected and localized with LAN analysis devices than with the normal performance tools.
- Do you have two LANs connected through a system?
Using a system as a router consumes large amounts of CPU time to process and copy packets. It is also subject to interference from other work being processed by the system. Dedicated hardware routers and bridges are usually a more cost-effective and robust solution.
- Is there a clear purpose for each NFS mount?
At some stages in the development of distributed configurations, NFS mounts are used to give users on new systems access to their home directories on their original systems. This situation simplifies the initial transition, but imposes a continuing data communication cost. It is not unknown to have users on system A interacting primarily with data on system B and vice versa.
Access to files through NFS imposes a considerable cost in LAN traffic, client and server CPU time, and end-user response time. A general guideline is that user and data should normally be on the same system. The exceptions are those situations in which an overriding concern justifies the extra expense and time of remote data. Some examples are a need to centralize data for more reliable backup and control, or a need to ensure that all users are working with the most current version of a program.
If these and other needs dictate a significant level of NFS client-server interchange, it is better to dedicate a system to the role of server than to have a number of systems that are part-server, part-client.
- Have programs been ported correctly and justifiably to use remote procedure
calls (RPCs)?
The simplest method of porting a program into a distributed environment is to replace program calls with RPCs on a 1:1 basis. Unfortunately, the disparity in performance between local program calls and RPCs is even greater than the disparity between local disk I/O and NFS I/O. Assuming that the RPCs are really necessary, they should be batched whenever possible.